CpG Methylation Analysis using M3D

M3D is an R package that distinguishes DMR (differential methylation regions) by kernel methods. (SVM?) Promoter regions of housekeeping genes and its CpG islands are known to be less methylated. This makes a difference in gene expression regulation. Therefore, promoter region’s degree of methylation is an indicator of gene expression.

Also, RRBS (Reduced Representation Bisulfite Sequencing) is an efficient experiment for measuring methylation of dense CpG regions’ cytosines.

  1. Digestion: MspI is used to digest genomic DNA, unbiased towards CpG methylation. It targets 5’CCGG3’ sequences and cleaves phosphodiester bonds upstream of CpG dinucleotide. Each fragment has a CpG at the end then.

  2. End repair and A-tailing: MspI cleaves double stranded DNA and results in sticky ends. End-repair is required. And then add A-tail to both plus and minus strands at the 3’ terminal. It’s necessary for adapter ligation. (dATPs are added in excess to increase efficiency.)

  3. Sequence adapters: Methylated sequence adapters (oligonucleotides) -> All cytosines are replaced with 5’methyl-cytosines.

  4. Fragment purification: Gel electrophoresis and select for 40-220 bp length fragments that are representiative of the majority of promoter sequences and CpG islands.

  5. Bisulfite conversion: Application of bisulfite results in the deamination of unmethylated cytosines into uracils.

  6. PCR amplification

  7. PCR purification

  8. Sequencing

Statistical analysis includes β binomial distribution hypothesis testing for CpG loci, and individual CpGs are connected in order to produce an estimated values of differentially methylated regions (DMR).

Install Bioconductor Manager, M3D, BiSeq

my_packages <- c("M3D", "BiSeq")
lapply(my_packages, library, character.only=TRUE)
[[1]]
 [1] "BiSeq"                "Formula"              "SummarizedExperiment" "DelayedArray"        
 [5] "BiocParallel"         "matrixStats"          "Biobase"              "GenomicRanges"       
 [9] "GenomeInfoDb"         "IRanges"              "S4Vectors"            "BiocGenerics"        
[13] "parallel"             "stats4"               "M3D"                  "stats"               
[17] "graphics"             "grDevices"            "utils"                "datasets"            
[21] "methods"              "base"                

[[2]]
 [1] "BiSeq"                "Formula"              "SummarizedExperiment" "DelayedArray"        
 [5] "BiocParallel"         "matrixStats"          "Biobase"              "GenomicRanges"       
 [9] "GenomeInfoDb"         "IRanges"              "S4Vectors"            "BiocGenerics"        
[13] "parallel"             "stats4"               "M3D"                  "stats"               
[17] "graphics"             "grDevices"            "utils"                "datasets"            
[21] "methods"              "base"                
group <- factor(c('H1-hESC', 'H1-hESC', 'K562', 'K562'))
samples <- c('H1-hESC1', 'H1-hESC2', 'K562-1', 'K562-2')
colData <- data.frame(group, row.names = samples)
colData
ABCDEFGHIJ0123456789
 
 
group
<fctr>
H1-hESC1H1-hESC
H1-hESC2H1-hESC
K562-1K562
K562-2K562
levels(colData$group)
[1] "H1-hESC" "K562"   
class(colData)
[1] "data.frame"

Note that data.frame and DataFrame are different. data.frame is a built-in data structure, and DataFrame is a function in S4Vectors and utilized by Bioconductor.

colData <- DataFrame(group, row.names = samples)
class(colData)
[1] "DFrame"
attr(,"package")
[1] "S4Vectors"

DataFrame is a data structure included in Bioconductor. Refer to this link: Bioconductor DataFrame explanation

The file that will be used in this session are ENCODE RRBS data of H1-hESC and K562 cell lines. H1-hESC cell’s two samples are two be compared to two samples of K562 leukemia cell. Download the necessary bed file from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/)

A BED file (.bed) is a tab-delimited text file that defines a feature track. It can have any file extension, but .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGV.

path = "/home/seungho/GDA/PublicENCODE" # path to the directory
setwd(path)
The working directory was changed to /home/seungho/GDA/PublicENCODE inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
files <- c('wgEncodeHaibMethylRrbsH1hescHaibSitesRep1.bed.gz', 'wgEncodeHaibMethylRrbsH1hescHaibSitesRep2.bed.gz', 'wgEncodeHaibMethylRrbsK562HaibSitesRep1.bed.gz', 'wgEncodeHaibMethylRrbsK562HaibSitesRep2.bed.gz')
rrbs <- readBedFiles(files, colData, bed_type = 'Encode')
Reading in files
Processing sample H1-hESC1 ... 
Read 1288079 records
Begin sorting data
Data sort complete
Processing sample H1-hESC2 ... 
Read 1302577 records
Begin sorting data
Data sort complete
Processing sample K562-1 ... 
Read 1275818 records
Begin sorting data
Data sort complete
Processing sample K562-2 ... 
Read 1114927 records
Begin sorting data
Data sort complete
Building BSraw object.

BED format file is, again, a file format that you are able to download through UCSC’s ENCODE data portal. It is tab-delimited and includes next attributes:

the chromosome name, start coordinate, end coordinate, some identifier, read coverage (#T), strand, start and end coordinates again(we discard this duplicated information), some color value, read coverage(#T) and the methylation percentage.

BED format file must have a header row, and the coordinate reference for the locus is 0.

readBedFile is a function that resides in the namespace of M3D. The following is its code.

function (files, colData, bed_type = "Encode", eData = NaN) 
{
    if (nrow(colData) != length(files)) {
        stop("Row number of colData must equal length of files.")
    }
    if (!test_colData_structure(colData)) {
        stop("Files must be specified group by group. Please re-make colData to\n             show all of one group, then all of the next, etc.")
    }
    message("Reading in files\n")
    methData <- readBedRaw(files, colData, bed_type = bed_type)
    cat("Building BSraw object.\n")
    fData <- methData[[1]]
    if (length(methData) > 1) {
        for (i in seq(along = methData)[-1]) {
            fData <- unique(c(fData, methData[[i]]))
        }
    }
    elementMetadata(fData) <- NULL
    names(fData) <- as.character(1:length(fData))
    tReads <- matrix(integer(length = length(fData) * length(methData)), 
        nrow = length(fData))
    mReads <- matrix(integer(length = length(fData) * length(methData)), 
        nrow = length(fData))
    for (i in seq(along = methData)) {
        mtch <- findOverlaps(fData, methData[[i]])
        mtch.m <- as.matrix(mtch)
        ind1 <- mtch.m[, 1]
        ind2 <- mtch.m[, 2]
        tReads[ind1, i] <- elementMetadata(methData[[i]])$reads[ind2]
        mReads[ind1, i] <- elementMetadata(methData[[i]])$methylated[ind2]
    }
    colnames(tReads) <- rownames(colData)
    colnames(mReads) <- rownames(colData)
    rownames(tReads) <- names(fData)
    rownames(mReads) <- names(fData)
    if (is.na(eData)) {
        eData <- SimpleList(exp = "Demo")
    }
    rrbs = BSraw(exptData = eData, colData = colData, rowRanges = fData, 
        totalReads = tReads, methReads = mReads)
    return(rrbs)
}
<bytecode: 0x55df317be410>
<environment: namespace:M3D>

Next, use BiSeq package to generate GRanges file for the region. BiSeq package is built for processing bisulfite-conversion sequenced data.

rrbs.CpGs <- clusterSites(object = rrbs, groups = unique(colData(rrbs)$group), perc.samples = 3/4, min.sites = 20, max.dist = 100)
CpGs <- clusterSitesToGR(rrbs.CpGs)

In order to understand the next step, you gotta know what coverage (or read depth) is. Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence. Deep sequencing aims at higher coverage of each unique reads.

Now, we cut off the regions with coverage (read depth) less than 100. totalReads(rrbs.CpGs) is a matrix that contains position and total coverage of each position.

In the following code, overlaps is

> overlaps
Hits object with 576106 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1           1
       [2]         1           2
       [3]         1           3
       [4]         1           4
       [5]         1           5
       ...       ...         ...
  [576102]     14453      576102
  [576103]     14453      576103
  [576104]     14453      576104
  [576105]     14453      576105
  [576106]     14453      576106
  -------

and

subjectHits(overlaps[ queryHits(overlaps)== 1])
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

subjectHits brings indices of where that number of overlap is.

inds <- vector()
overlaps <- findOverlaps(CpGs, rowRanges(rrbs.CpGs))
for (i in 1:length(CpGs)){
  covs <- colSums(totalReads(rrbs.CpGs)[subjectHits(overlaps[ queryHits(overlaps) == i]),])
  if (!any(covs <= 100)){
    inds <- c(inds, i)
  }
}
CpGs <- CpGs[inds]
rm(inds)

For next analysis, use only 1000 regions. Update overlaps object and explain correct overlaps.

rrbs <- rrbs.CpGs
CpGs <- CpGs[1:1000]
overlaps <- findOverlaps(CpGs, rowRanges(rrbs))
rrbs <- rrbs[subjectHits(overlaps)]
overlaps <- findOverlaps(CpGs, rowRanges(rrbs))

Use the two data in ENCODE consortium H1-hESC and K562 as the two RRBS files.

Use BiSeq package’s functions clusterSites and clusterSitesToGR for clustering.

Delete islands of all samples that have less than 100? and include only the first thousand.

Analysis is stored in GRanges and each item shows a region’s start and end.

rrbs
class: BSraw 
dim: 39907 4 
metadata(0):
assays(2): totalReads methReads
rownames(39907): 643464 643465 ... 687062 687063
rowData names(1): cluster.id
colnames(4): H1-hESC1 H1-hESC2 K562-1 K562-2
colData names(1): group
CpGs
GRanges object with 1000 ranges and 1 metadata column:
         seqnames              ranges strand | cluster.id
            <Rle>           <IRanges>  <Rle> |   <factor>
     [1]     chr1       840131-840357      * |     chr1_1
     [2]     chr1       845246-845561      * |     chr1_2
     [3]     chr1       858978-859375      * |     chr1_3
     [4]     chr1       859529-859727      * |     chr1_4
     [5]     chr1       875730-875953      * |     chr1_5
     ...      ...                 ...    ... .        ...
   [996]     chr1 168105869-168106338      * |  chr1_1009
   [997]     chr1 168148116-168148585      * |  chr1_1010
   [998]     chr1 168194959-168195245      * |  chr1_1011
   [999]     chr1 169075431-169075641      * |  chr1_1012
  [1000]     chr1 170633423-170633656      * |  chr1_1013
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

Maximum Mean Discrepancy (MMD) and Coverage

M3D analysis uses MMD values for the entire methylation data and the applied regions. Both can be obtained through M3D_Wrapper function. Sample pairs’ entire methylated MMD matrix is the return value of M3D_Wrapper. The second result is equivalent coverage. MMDlist$Full and MMDlist$Coverage variables are those. Substitute the difference between the two values into M3Dstat, and this shows structure of each items.

head(MMDlist$Full)
     H1-hESC1  vs  H1-hESC2 H1-hESC1  vs  K562-1
[1,]             0.13328796            0.3711302
[2,]             0.15843597            0.1991325
[3,]             0.11149443            0.1376077
[4,]             0.05180868            0.1450703
[5,]             0.17830486            0.2438087
[6,]             0.02382480            0.0972950
     H1-hESC1  vs  K562-2 H1-hESC2  vs  K562-1 H1-hESC2  vs  K562-2
[1,]            0.3882027           0.23581746           0.22551184
[2,]            0.1901730           0.14194300           0.14213741
[3,]            0.1476828           0.03963481           0.05186073
[4,]            0.1196117           0.11307940           0.07325263
[5,]            0.1983181           0.07915218           0.10918123
[6,]            0.1378252           0.09148931           0.13047384
     K562-1  vs  K562-2
[1,]         0.06755053
[2,]         0.04260434
[3,]         0.01802223
[4,]         0.05798640
[5,]         0.09577036
[6,]         0.05080778
head(MMDlist$Coverage)
     H1-hESC1  vs  H1-hESC2 H1-hESC1  vs  K562-1 H1-hESC1  vs  K562-2 H1-hESC2  vs  K562-1 H1-hESC2  vs  K562-2 K562-1  vs  K562-2
[1,]             0.13245172            0.3979882            0.3998015           0.26189520           0.23546355         0.06338745
[2,]             0.21273851            0.1709892            0.1664227           0.05580718           0.07902125         0.04269711
[3,]             0.11562054            0.1408935            0.1493031           0.04014058           0.04976709         0.01636765
[4,]             0.05353733            0.1461974            0.1205286           0.11230110           0.07220178         0.05731698
[5,]             0.17977249            0.2456640            0.2034446           0.07999092           0.11754403         0.09997088
[6,]             0.02360063            0.0988395            0.1365794           0.09302259           0.12916722         0.04774588
M3Dstat <- MMDlist$Full - MMDlist$Coverage

Below is a scatterplot for comparison among replicates of Full MMD and Coverage MMD.

repCols <- c(1, 6)
plot(as.vector(MMDlist$Full[, repCols]), as.vector(MMDlist$Coverage[, repCols]), xlab = 'Full MMD', ylab = 'Coverage MMD', main = 'Test Statistic: Replicate Comparison')
abline(a = 0, b = 1, col='red', lwd=2)

groupCols <- 2:5
plot(as.vector(MMDlist$Full[, groupCols]), as.vector(MMDlist$Coverage[, groupCols]), xlab = 'Full MMD', ylab = 'Coverage MMD', main = 'Test Statistics: Inter-Group Comparison')
abline(a=0, b=1, col='red', lwd=2)

The following code prints M3D analysis result as histograms.

repCols <- c(1:6)
groupCols <- 2:5
M3Dstat <- MMDlist$Full - MMDlist$Coverage
breaks <- seq(-0.2, 1.2, 0.1)
hReps <- hist(M3Dstat[, repCols], breaks = breaks, plot = FALSE)
hGroups <- hist(rowMeans(M3Dstat[, groupCols]), breaks = breaks, plot = FALSE)
col1 <- "#0000FF40"
col2 <- "#FF000040"
plot(hReps, col = col1, freq = FALSE, ylab = "Density", xlab = "MMD statistic", main = "M3D Stat Histogram")
plot(hGroups, col = col2, freq = FALSE, add = TRUE)
legend(x = 0.5, y = 3, legend = c("Replicates", "Groups"), fill = c(col1, col2))

Next is histone modification data search in the EpiFactors database.

---
title: "M3D_practice"
output: html_notebook
---
# CpG Methylation Analysis using M3D
M3D is an R package that distinguishes DMR (differential methylation regions) by kernel methods. (SVM?)
Promoter regions of housekeeping genes and its CpG islands are known to be less methylated. This makes a difference in gene expression regulation. Therefore, promoter region's degree of methylation is an indicator of gene expression.

Also, RRBS (Reduced Representation Bisulfite Sequencing) is an efficient experiment for measuring methylation of dense CpG regions' cytosines.

1. Digestion: MspI is used to digest genomic DNA, unbiased towards CpG methylation. It targets 5'CCGG3' sequences and cleaves phosphodiester bonds upstream of CpG dinucleotide. Each fragment has a CpG at the end then.

2. End repair and A-tailing: MspI cleaves double stranded DNA and results in sticky ends. End-repair is required. And then add A-tail to both plus and minus strands at the 3' terminal. It's necessary for adapter ligation. (dATPs are added in excess to increase efficiency.)

3. Sequence adapters: Methylated sequence adapters (oligonucleotides) -> All cytosines are replaced with 5'methyl-cytosines.

4. Fragment purification: Gel electrophoresis and select for 40-220 bp length fragments that are representiative of the majority of promoter sequences and CpG islands.

5. Bisulfite conversion: Application of bisulfite results in the deamination of unmethylated cytosines into uracils.

6. PCR amplification

7. PCR purification

8. Sequencing

![](https://upload.wikimedia.org/wikipedia/commons/f/ff/Reduced_Representation_Bisulfite_Sequencing_Protocol.jpg)
 
Statistical analysis includes $\beta$ binomial distribution hypothesis testing for CpG loci, and individual CpGs are connected in order to produce an estimated values of differentially methylated regions (DMR).

## Install Bioconductor Manager, M3D, BiSeq
```{r, echo=FALSE}
install.packages("BiocManager")
BiocManager::install("M3D")
BiocManager::install("BiSeq")
```


```{r}
my_packages <- c("M3D", "BiSeq")
lapply(my_packages, library, character.only=TRUE)
```
```{r}
group <- factor(c('H1-hESC', 'H1-hESC', 'K562', 'K562'))
samples <- c('H1-hESC1', 'H1-hESC2', 'K562-1', 'K562-2')
colData <- data.frame(group, row.names = samples)
colData
```

```{r}
levels(colData$group)
```
```{r}
class(colData)
```
Note that `data.frame` and `DataFrame` are different. `data.frame` is a built-in data structure, and `DataFrame` is a function in S4Vectors and utilized by Bioconductor.

```{r}
colData <- DataFrame(group, row.names = samples)
class(colData)
```
DataFrame is a data structure included in Bioconductor. Refer to this link:
[Bioconductor DataFrame explanation](https://www.bioconductor.org/help/course-materials/2019/BiocDevelForum/02-DataFrame.pdf)

The file that will be used in this session are ENCODE RRBS data of H1-hESC and K562 cell lines. H1-hESC cell's two samples are two be compared to two samples of K562 leukemia cell. Download the necessary bed file from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/)

A BED file (.bed) is a tab-delimited text file that defines a feature track. It can have any file extension, but .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGV.

```{r}
path = "/home/seungho/GDA/PublicENCODE" # path to the directory
setwd(path)
files <- c('wgEncodeHaibMethylRrbsH1hescHaibSitesRep1.bed.gz', 'wgEncodeHaibMethylRrbsH1hescHaibSitesRep2.bed.gz', 'wgEncodeHaibMethylRrbsK562HaibSitesRep1.bed.gz', 'wgEncodeHaibMethylRrbsK562HaibSitesRep2.bed.gz')
rrbs <- readBedFiles(files, colData, bed_type = 'Encode')
```
BED format file is, again, a file format that you are able to download through UCSC's ENCODE data portal. It is tab-delimited and includes next attributes:

the chromosome name, start coordinate, end coordinate, some identifier, read coverage (#T), strand, start and end coordinates again(we discard this duplicated information), some color value, read coverage(#T) and the methylation percentage.

BED format file must have a header row, and the coordinate reference for the locus is 0.

`readBedFile` is a function that resides in the namespace of M3D. The following is its code.

```
function (files, colData, bed_type = "Encode", eData = NaN) 
{
    if (nrow(colData) != length(files)) {
        stop("Row number of colData must equal length of files.")
    }
    if (!test_colData_structure(colData)) {
        stop("Files must be specified group by group. Please re-make colData to\n             show all of one group, then all of the next, etc.")
    }
    message("Reading in files\n")
    methData <- readBedRaw(files, colData, bed_type = bed_type)
    cat("Building BSraw object.\n")
    fData <- methData[[1]]
    if (length(methData) > 1) {
        for (i in seq(along = methData)[-1]) {
            fData <- unique(c(fData, methData[[i]]))
        }
    }
    elementMetadata(fData) <- NULL
    names(fData) <- as.character(1:length(fData))
    tReads <- matrix(integer(length = length(fData) * length(methData)), 
        nrow = length(fData))
    mReads <- matrix(integer(length = length(fData) * length(methData)), 
        nrow = length(fData))
    for (i in seq(along = methData)) {
        mtch <- findOverlaps(fData, methData[[i]])
        mtch.m <- as.matrix(mtch)
        ind1 <- mtch.m[, 1]
        ind2 <- mtch.m[, 2]
        tReads[ind1, i] <- elementMetadata(methData[[i]])$reads[ind2]
        mReads[ind1, i] <- elementMetadata(methData[[i]])$methylated[ind2]
    }
    colnames(tReads) <- rownames(colData)
    colnames(mReads) <- rownames(colData)
    rownames(tReads) <- names(fData)
    rownames(mReads) <- names(fData)
    if (is.na(eData)) {
        eData <- SimpleList(exp = "Demo")
    }
    rrbs = BSraw(exptData = eData, colData = colData, rowRanges = fData, 
        totalReads = tReads, methReads = mReads)
    return(rrbs)
}
<bytecode: 0x55df317be410>
<environment: namespace:M3D>
```

Next, use BiSeq package to generate GRanges file for the region. BiSeq package is built for processing bisulfite-conversion sequenced data.

```{r}
rrbs.CpGs <- clusterSites(object = rrbs, groups = unique(colData(rrbs)$group), perc.samples = 3/4, min.sites = 20, max.dist = 100)
```

```{r}
CpGs <- clusterSitesToGR(rrbs.CpGs)
```

In order to understand the next step, you gotta know what **coverage** (or **read depth**) is. Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed  sequence. Deep sequencing aims at higher coverage of each unique reads.

Now, we cut off the regions with coverage (read depth) less than 100. `totalReads(rrbs.CpGs)` is a matrix that contains position and total coverage of each position.

In the following code, `overlaps` is
```
> overlaps
Hits object with 576106 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1           1
       [2]         1           2
       [3]         1           3
       [4]         1           4
       [5]         1           5
       ...       ...         ...
  [576102]     14453      576102
  [576103]     14453      576103
  [576104]     14453      576104
  [576105]     14453      576105
  [576106]     14453      576106
  -------
```
and 
```
subjectHits(overlaps[ queryHits(overlaps)== 1])
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
```
`subjectHits` brings indices of where that number of overlap is.

```{r}
inds <- vector()
overlaps <- findOverlaps(CpGs, rowRanges(rrbs.CpGs))
for (i in 1:length(CpGs)){
  covs <- colSums(totalReads(rrbs.CpGs)[subjectHits(overlaps[ queryHits(overlaps) == i]),])
  if (!any(covs <= 100)){
    inds <- c(inds, i)
  }
}
CpGs <- CpGs[inds]
rm(inds)
```

For next analysis, use only 1000 regions. Update `overlaps` object and explain correct `overlaps`.

```{r}
rrbs <- rrbs.CpGs
CpGs <- CpGs[1:1000]
overlaps <- findOverlaps(CpGs, rowRanges(rrbs))
rrbs <- rrbs[subjectHits(overlaps)]
overlaps <- findOverlaps(CpGs, rowRanges(rrbs))
```


Use the two data in ENCODE consortium `H1-hESC` and `K562` as the two RRBS files.

Use BiSeq package's functions `clusterSites` and `clusterSitesToGR` for clustering.

Delete islands of all samples that have less than 100? and include only the first thousand.

Analysis is stored in GRanges and each item shows a region's start and end.

```{r}
rrbs
```

```{r}
CpGs
```
## Maximum Mean Discrepancy (MMD) and Coverage

M3D analysis uses MMD values for the entire methylation data and the applied regions. Both can be obtained through `M3D_Wrapper` function. Sample pairs' entire methylated MMD matrix is the return value of `M3D_Wrapper`. The second result is equivalent coverage. `MMDlist$Full` and `MMDlist$Coverage` variables are those. Substitute the difference between the two values into M3Dstat, and this shows structure of each items.

```{r, include=FALSE}
MMDlist <- M3D_Wrapper(rrbs, overlaps)
```

```{r}
head(MMDlist$Full)
```

```{r}
head(MMDlist$Coverage)
```

```{r}
M3Dstat <- MMDlist$Full - MMDlist$Coverage
```

Below is a scatterplot for comparison among replicates of Full MMD and Coverage MMD.
```{r}
repCols <- c(1, 6)
plot(as.vector(MMDlist$Full[, repCols]), as.vector(MMDlist$Coverage[, repCols]), xlab = 'Full MMD', ylab = 'Coverage MMD', main = 'Test Statistic: Replicate Comparison')
abline(a = 0, b = 1, col='red', lwd=2)
```
```{r}
groupCols <- 2:5
plot(as.vector(MMDlist$Full[, groupCols]), as.vector(MMDlist$Coverage[, groupCols]), xlab = 'Full MMD', ylab = 'Coverage MMD', main = 'Test Statistics: Inter-Group Comparison')
abline(a=0, b=1, col='red', lwd=2)
```
The following code prints M3D analysis result as histograms.
```{r}
repCols <- c(1:6)
groupCols <- 2:5
M3Dstat <- MMDlist$Full - MMDlist$Coverage
breaks <- seq(-0.2, 1.2, 0.1)
hReps <- hist(M3Dstat[, repCols], breaks = breaks, plot = FALSE)
hGroups <- hist(rowMeans(M3Dstat[, groupCols]), breaks = breaks, plot = FALSE)
col1 <- "#0000FF40"
col2 <- "#FF000040"
plot(hReps, col = col1, freq = FALSE, ylab = "Density", xlab = "MMD statistic", main = "M3D Stat Histogram")
plot(hGroups, col = col2, freq = FALSE, add = TRUE)
legend(x = 0.5, y = 3, legend = c("Replicates", "Groups"), fill = c(col1, col2))
```

Next is histone modification data search in the EpiFactors database.